In this data exploration we are looking at the US Real Estate market with use of a dataset from kaggle published by Ahmed Shahriar Sakib. It contains over 2.2 Million Real Estate listings broken down to State, Size, Price (among other factors). (Source: https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset/data)
Data import
Instructions
Download and import your dataset. Assign each variable appropriate/correct data types. Discuss/explain your choice for each variable (groups) in the main text.
map_size <- data %>%group_by(state) %>%summarise(avg_size =mean(house_size, na.rm =TRUE), .groups ="drop") %>%inner_join(valid_states, by =c("state"="state_name"))plot_ly( map_size,type ="choropleth",locationmode ="USA-states",locations =~state_abbr,z =~avg_size,text =~paste0(state, "<br>Avg Size: ", round(avg_size), " sqft"),colorscale ="Greens",colorbar =list(title ="Avg Size (sqft)")) %>% plotly::layout(title =list(text ="Average House Size by U.S. State"),geo =list(scope ="usa", projection =list(type ="albers usa")) )
USA Map Extremfälle Price (Karte)
Code
map_extremes <- data %>%group_by(state) %>%summarise(min_price =suppressWarnings(min(price, na.rm =TRUE)),max_price =suppressWarnings(max(price, na.rm =TRUE)),.groups ="drop" ) %>%mutate(range_price = max_price - min_price) %>%inner_join(valid_states, by =c("state"="state_name"))plot_ly( map_extremes,type ="choropleth",locationmode ="USA-states",locations =~state_abbr,z =~range_price,text =~paste0( state,"<br>Min: $", formatC(min_price, big.mark =",", format ="f", digits =0),"<br>Max: $", formatC(max_price, big.mark =",", format ="f", digits =0) ),colorscale ="Reds",colorbar =list(title ="Price Range ($)")) %>% plotly::layout(title =list(text ="Price Extrem Values by U.S. State (Max − Min)"),geo =list(scope ="usa", projection =list(type ="albers usa")) )
House Size vs Lot Size
Multiple Regression mit price als Zielvariable, Prädiktoren bed, bath, house_size, acre_lot.
Anzahl aktiver Listings pro State
Summary
Instructions
Summarise your finding.
Source Code
---title: "Data Exploration Gabriel Hamulic"subtitle: "Dataset: US Real Estate Dataset"author: "Hamulic, Gabriel"date: todayembed-resources: trueformat: html: output-file: US-Real-Estate_Gabriel_Hamulic.html #output-ext: "html.html" toc: true toc-location: right code-link: true code-tools: true #df-print: kable theme: light: flatly dark: darkly #echo: fenced pdf: output-file: US-Real-Estate_Gabriel_Hamulic.pdf toc: true number-sections: true code-link: true df-print: tibble crossref: lof-title: "List of Figures"fig-align: centerexecute: warning: false---\listoffigures \listoftables\listoflistings::: callout-caution### InstructionsYour report must be of high quality, meaning that your report:- is visually and textually pleasing of- does not look/read/feel like a draft instead of a finished analysis- explains/discusses your findings and results in the main text, e.g., explain/discuss all figures/table in the main text- is representable such that it can show to any interested third party- uses figure/table captions/linking/reference (see example further down)- Do not show any standard printout of R-code, use for data.frame/tibbles `knitr::kable()` printing.- Do not simply print datasets (too many lines) use instead `rmarkdown::paged_table()`:::{{< pagebreak >}}# Introduction## Libraries```{r}#| code-summary: Libraries#| code-fold: truelibrary <-function(...) {suppressPackageStartupMessages(base::library(...))}library(tidyverse)library(dplyr)library(knitr)library(tidyr)library(rmarkdown)library(janitor)library(scales)library(tidytext)library(ggforce)library(GGally)```# Data## Data sourceIn this data exploration we are looking at the US Real Estate market with use of a dataset from kaggle published by Ahmed Shahriar Sakib. It contains over 2.2 Million Real Estate listings broken down to State, Size, Price (among other factors). (Source: <https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset/data>)## Data import::: callout-caution### InstructionsDownload and import your dataset. Assign each variable appropriate/correct data types. Discuss/explain your choice for each variable (groups) in the main text.:::```{r setup, include=FALSE}options(dplyr.print_max =10, dplyr.print_min =10)``````{r}# Import Datadata =read.csv("data/realtor-data.zip.csv")```## Data Transformation```{r}# Keep relevant Columnsdata =subset(data, select =c(status, price, bed, bath, acre_lot, city, state, house_size))``````{r}str_out <-capture.output(str(data))kable(data.frame(Output = str_out), col.names ="Structure Output")``````{r}# Assign Data Typesdata$status =as.factor(data$status)data$city =as.factor(data$city)data$state =as.factor(data$state)``````{r}str_out <-capture.output(str(data))kable(data.frame(Output = str_out), col.names ="Structure Output")``````{r}before_rows <-nrow(data)data <-na.omit(data)after_rows <-nrow(data)kable(data.frame(Description =c("Before NA removal", "After NA removal"),Rows =c(before_rows, after_rows)))```The dataset now has `r nrow(data)` observations and `r ncol(data)` variables after removing rows with missing values.```{r}data = data |>mutate(price_per_sqm = price/house_size)# Filter min and max valuesdata = data |>filter(price >10000& price <1000000000)```### Cleaned Dataset```{r}paged_table(data)``````{r}str(data)```We chose the following data types for our variables:- ## Data dictionaryThe following variables of the dataset will be looked into:- price - The price for which the item was listed on the market- status - The status if the house is already sold or still for sale- acre_lot - The size of the land / lot on which the house is located in acres- state - The state in which the house is located- house size - The size of the house in square feet# Summary statistic tablesIn this section we will cover the summary of our cleaned dataset. We will explore basic statistical values from our data.## Numeric Statistics```{r}#| label: "Numeric Statistics"#| tbl-cap: "Summary statistics of numerical variables in dataframe"#| code-fold: truedata |> janitor::clean_names() |>mutate(row =row_number() |>factor()) |>pivot_longer(cols =where(is.numeric)) |>group_by(name) |>summarize(N =n(),min =min(value),mean =mean(value),median =median(value),max =max(value),st.dev =sd(value) ) |> knitr::kable(digits =2)``````{r}#| label: "Logarithmic Visualisation"#| tbl-cap: "Visualisation of numerical variables in dataframe"#| code-fold: truedata |>clean_names() |>pivot_longer(cols =where(is.numeric)) |>ggplot(aes(x = value, fill = name)) +geom_histogram(bins =30, alpha =0.7, color ="white") +scale_x_log10(labels =label_comma()) +# 👈 echte Werte, log-Skalafacet_wrap(~ name, scales ="free_x") +theme_minimal() +labs(title ="Distribution of Numerical Variables (logarithmic scale)",x ="Value",y ="Count" ) +theme(legend.position ="none",axis.text.x =element_text(angle =25, hjust =1) )```## Nominal Statistics```{r}#| label: "Nominal Statistics"#| tbl-cap: "Top categories for factor variables with counts, proportions, and mean price"#| code-fold: truetop_n_per_var <-10nominal_summary <- data |>clean_names() |>select(where(is.factor), price) |>pivot_longer(cols =where(is.factor),names_to ="Variable",values_to ="Category") |>group_by(Variable, Category) |>summarise(Count =n(),Percent =round(100* Count /nrow(data), 2),Mean_Price =round(mean(price, na.rm =TRUE), 0),.groups ="drop" ) |>group_by(Variable) |>slice_max(order_by = Count, n = top_n_per_var, with_ties =FALSE) |>ungroup()kable( nominal_summary,caption =paste0("Top ", top_n_per_var," categories per factor variable (counts, share %, and mean price)" ),digits =2,align =c("l", "l", "r", "r", "r"))``````{r}#| label: "Visualization Nominal Statistics"#| tbl-cap: "Top categories for factor variables with counts, proportions, and mean price"#| code-fold: truenominal_summary <- nominal_summary |>group_by(Variable) |>mutate(Category = forcats::fct_reorder(Category, Count),Category =factor(Category, levels =unique(Category))) |>ungroup()# 🔹 Plot: Facets untereinander, mit eigener x-Skala und y-Skala pro Variableggplot(nominal_summary, aes(x = Count, y = Category, fill = Variable)) +geom_col(show.legend =FALSE, alpha =0.8, width =0.7) +facet_wrap(~ Variable, ncol =1, scales ="free", drop =TRUE) +scale_x_continuous(labels =label_comma()) +# 👈 Tausendertrennung, keine 1e+05theme_minimal() +labs(title ="Top Categories per Factor Variable",x ="Count",y ="Category" ) +theme(panel.spacing.y =unit(1, "lines"),strip.text =element_text(size =12, face ="bold"),axis.text.y =element_text(size =8),plot.margin =margin(5, 15, 5, 5) )```# Bivariate Analyse### Pairs Plot (all numeric variables)```{r}#| label: "Pairs Plot"#| code-fold: trueset.seed(123)data_num <- data |> janitor::clean_names() |>select(where(is.numeric)) |>slice_sample(n =3000) |>mutate(across(everything(), log1p)) # 👈 log(1+x) gegen Ausreißerp <-ggpairs( data_num,progress =FALSE,upper =list(continuous =wrap("cor", size =4, alignPercent =0.8, stars =TRUE)),lower =list(continuous =wrap("points", alpha =0.3, size =0.7)),diag =list(continuous =wrap("densityDiag", alpha =0.7)))p +theme_minimal(base_size =11) +theme(strip.text =element_text(size =8, face ="bold"),panel.grid =element_blank(),axis.text =element_text(size =8),axis.title =element_text(size =9),plot.title =element_text(face ="bold", size =14, hjust =0.5) ) +labs(title ="Scatterplot-Matrix (log-transformiert, n=3000)")```### Preis vs. Anzahl Badezimmer### Preis vs. Anzahl Schlafzimmer### Preperation for Maps```{r}#| label: "state_mapping"#| code-fold: truelibrary(dplyr)library(tibble)valid_states <-tibble(state_name =c(state.name, "District of Columbia"),state_abbr =c(state.abb, "DC"))```### USA Map – Average Price```{r}#| label: "map_avg_price"#| tbl-cap: "Average property price by U.S. state"#| code-fold: truelibrary(dplyr)library(plotly)map_price <- data |>group_by(state) %>%summarise(avg_price =mean(price, na.rm =TRUE), .groups ="drop") |>inner_join(valid_states, by =c("state"="state_name")) |>mutate(avg_price_k = avg_price /1000)plot_ly( map_price,type ="choropleth",locationmode ="USA-states",locations =~state_abbr,z =~avg_price_k,text =~paste0(state, "<br>Avg Price: $", round(avg_price_k, 1), "K"),colorscale ="Blues",colorbar =list(title ="Avg Price ($K)")) |> plotly::layout(title =list(text ="Average Property Price by U.S. State"),geo =list(scope ="usa", projection =list(type ="albers usa")) )```### USA Map AVG Size (Karte)```{r}#| label: "map_avg_size"#| tbl-cap: "Average house size by U.S. state"#| code-fold: truemap_size <- data %>%group_by(state) %>%summarise(avg_size =mean(house_size, na.rm =TRUE), .groups ="drop") %>%inner_join(valid_states, by =c("state"="state_name"))plot_ly( map_size,type ="choropleth",locationmode ="USA-states",locations =~state_abbr,z =~avg_size,text =~paste0(state, "<br>Avg Size: ", round(avg_size), " sqft"),colorscale ="Greens",colorbar =list(title ="Avg Size (sqft)")) %>% plotly::layout(title =list(text ="Average House Size by U.S. State"),geo =list(scope ="usa", projection =list(type ="albers usa")) )```### USA Map Extremfälle Price (Karte)```{r}#| label: "map_extreme_price"#| tbl-cap: "Price range (max − min) by U.S. state"#| code-fold: truemap_extremes <- data %>%group_by(state) %>%summarise(min_price =suppressWarnings(min(price, na.rm =TRUE)),max_price =suppressWarnings(max(price, na.rm =TRUE)),.groups ="drop" ) %>%mutate(range_price = max_price - min_price) %>%inner_join(valid_states, by =c("state"="state_name"))plot_ly( map_extremes,type ="choropleth",locationmode ="USA-states",locations =~state_abbr,z =~range_price,text =~paste0( state,"<br>Min: $", formatC(min_price, big.mark =",", format ="f", digits =0),"<br>Max: $", formatC(max_price, big.mark =",", format ="f", digits =0) ),colorscale ="Reds",colorbar =list(title ="Price Range ($)")) %>% plotly::layout(title =list(text ="Price Extrem Values by U.S. State (Max − Min)"),geo =list(scope ="usa", projection =list(type ="albers usa")) )```### House Size vs Lot Size### Multiple Regression mit price als Zielvariable, Prädiktoren bed, bath, house_size, acre_lot.### Anzahl aktiver Listings pro State# Summary::: callout-caution### InstructionsSummarise your finding.:::